The graphics system in R is very flexible: just about every aspect of your plot can be precisely controlled. However, this brings with it a serious learning curve - especially when you want to produce high quality polished figures for publication. At this point, you should know how to make simple plots and to control different aspects of formatting plots using the basic built-in graphics system in R, known as the base graphics system, including a familiarity with many of the different graphical parameter (par) options. If this is not the case, we suggest that you start with chapter 4 in the HIE R manual (‘Data analysis and Visualisation with R’).
Here, our focus is on advanced topics, particularly (i) interacting with the graphics window to produce more complicated plot types and (ii) examples of useful plot types that can be produced with specific packages.
In this document we have examples using the base graphics system and the popular ggplot2 package. The advantage of ggplot2 is that complex graphs, containing multiple panels, grouping variables, and so on, can be constructed with very little code. The disadvantage is that if the user wishes to modify any (or all) aspect of the plot, it becomes more and more difficult. The disadvantage of the base graphics system is that we often have to write quite a lot of code to construct more complicated plots, but the advantage is that modifying layout, annotations, etc. is much more straightforward. Many R packages also provide plotting functions for the base graphics system, thus providing many options.
Legends are easy to produce, but it can be difficult to find space for the legend on some plots, especially if there is a lot of text in the legend. However, we can use the xpd argument in the par function to set up a plot that allows us to prepare a legend on the outside of the plot margins. For example,
# read in data and split data.frame by species
coweeta <- read.csv("coweeta.csv")
# set plot margins, with extra white space on right side
par(xpd = T, mar = c(5,4,1,7))
with(coweeta, plot(height ~ DBH, pch=c(1:10)[species]))
legend(70, 35, levels(coweeta$species), pch=1:10, cex=0.8)
There are multiple ways to set up a multi-panel plot, the simplest of which is to use the ‘mfrow’ (or ‘mfcol’) argument with the ‘par’ function.
# take a subset of the species
cowsub <- droplevels(subset(coweeta, species %in% c("acru","bele","caov","qupr")))
# split data.frame by species
coweeta.spl <- split(cowsub, cowsub$species)
# set up graphics window with three rows and two columns
# also modify margins so plots are not compressed
par(mfrow=c(2,2), mar=c(4, 4, 0, 0))
# produce plot for each species
for(i in 1:length(coweeta.spl)) {
with(coweeta.spl[[i]], plot(folmass ~ height))
legend('topleft', names(coweeta.spl)[i], bty='n')
}
Try this yourself: In the above example, use a different colour for each species. First set up a vector of colours, then index the vector in the plot statement.
For plots such as these, it would be nicer to only have consistent axes and one title on each of the x- and y-axes. This can be done easily using functions in the ggplot2 library.
# load library
library(ggplot2)
# produce plots, using facet_wrap to separate plots by species
ggplot(cowsub, aes(x=height, y=folmass)) +
geom_point() +
facet_wrap(~species, nrow=2) +
labs(x="Height (m)", y="Total foliage mass (kg)")
We can define more complicated layouts using the layout function. For example, lets look at an example with two rows of plots but with unequal numbers of plots on each row.
# generate data subset
cowsub <- droplevels(subset(coweeta, species %in% c("acru","bele","caov","qupr")))
# specify layout using a matrix
# numbers in the matrix elements refer to plot numbers in that section
l <- layout(matrix(c(1, 2, 3, 3), nrow=2, byrow=T))
# show plot layout
layout.show(l)
# set margins
par(mar=c(4, 4, 1, 1))
# produce the three plots
# plot 1, top left, with legend
with(cowsub, plot(height ~ DBH, col=species, pch=16))
legend('bottomright', legend=levels(cowsub$species), pch=16, col=palette()[1:4], bty='n')
# plot 2, top right
with(cowsub, plot(folmass ~ age, col=species, pch=16))
# plot 3, along bottom
with(cowsub, boxplot(SLA ~ species, ylab='SLA'))
We can also specify different panel dimensions using the widths and heights arguments for the layout function.
# generate data subset
cowsub <- droplevels(subset(coweeta, species %in% c("acru","bele","caov","qupr")))
# specify layout using a matrix (note change to 'byrow=F')
# numbers in the matrix elements refer to plot numbers in that section
l <- layout(matrix(1:6, nrow=3, byrow=F), widths=c(0.33, 0.67), heights=c(rep(1/3, 3)))
# show plot layout
layout.show(l)
# set margins
par(mar=c(4, 4, 1, 1))
# produce the plots
# three plots on left
with(cowsub, plot(height ~ DBH, col=species, pch=16))
legend('bottomright', legend=levels(cowsub$species), pch=16, col=palette()[1:4], bty='n')
with(cowsub, plot(folmass ~ DBH, col=species, pch=16))
with(cowsub, plot(SLA ~ DBH, col=species, pch=16))
# three plots on right (with reduced margins on left and no labels on y-axis)
par(mar=c(4, 0, 1, 1))
with(cowsub, boxplot(height ~ species, yaxt='n'))
axis(2, labels=F)
with(cowsub, boxplot(folmass ~ species, yaxt='n'))
axis(2, labels=F)
with(cowsub, boxplot(SLA ~ species, yaxt='n'))
axis(2, labels=F)
split.screen for complicated layoutsSome panel orientations are too difficult to set up using ‘layout’. In these cases, ‘split.screen’ can be very useful.
# split display into two screens, one for each row
split.screen(c(2, 1))
## [1] 1 2
# set margins
par(mar=c(4,4,1,1))
# split screen 1 further into three columns (screens 3, 4, and 5)
split.screen(c(1, 3), screen=1)
## [1] 3 4 5
# screen 3, left
screen(3)
with(cowsub, plot(height ~ DBH, col=species, pch=16))
legend('bottomright', legend=levels(cowsub$species), pch=16, col=palette()[1:4], bty='n')
# screen 4, middle
screen(4)
with(cowsub, plot(folmass ~ DBH, col=species, pch=16))
# screen 5, right
screen(5)
with(cowsub, plot(SLA ~ DBH, col=species, pch=16))
# split screen 2 further into two colums (screens 6 and 7)
split.screen(c(1, 2), screen=2)
## [1] 6 7
# screen 6, left
screen(6)
with(cowsub, boxplot(age ~ species, ylab='age'))
# screen 7, right
screen(7)
with(cowsub, boxplot(elev ~ species, ylab='elevation'))
# exit split-screen mode
close.screen(all=T)
We can insert these graphics directly into the plot after reading the image using functions from an appropriate library (e.g., jpeg, tiff, png) and rendering the image on top of the plot that we have produced. For example, when visualising the distribution of spore pigmentation among fungal species, it would be helpful to also include images that represent extreme values. Here we use the ‘rasterImage’ function from the ‘graphics’ library, which is loaded when starting R. (Note that it can be tricky to set the image size correctly so that the image isn’t distorted).
# load library
library(jpeg)
# read in spore colour data
spores <- read.csv('sporecolour.csv')
# calculate spore 'blackness' for each row
spores$K <- 1 - apply(spores[, c('red', 'green', 'blue')], 1, max)/255
# read in three images representing three levels of blackness
caledonius <- readJPEG('Funneliformis_caledonius_immature.jpg')
erythropa <- readJPEG('Dentiscutata_erythropa_1.jpg')
constrictum <- readJPEG('Septoglomus_constrictum_1.jpg')
# set plot layout
layout(matrix(c(rep(1, 9), 2, 3, 4), nrow=4, byrow=T))
# plot histogram of spore 'blackness'
par(mar=c(5, 5, 1, 1))
hist(spores$K, main='', xlab='spore blackness (K)', cex.axis=1.5, cex.lab=2,
breaks=10, col=grey(seq(from=1, to=0, length.out=10)))
# plot and raster images in bottom panels
# set margins
par(mar=c(2, 1, 1, 1))
# plot and raster image on left
plot(c(0, 1), c(0, 1), type='n', axes=F, xlab='', ylab='')
axis(1, at=0.5, labels='F. caledonius (K = 0.05)', tick=F, line=-1, cex.axis=1.5)
rasterImage(caledonius, 0, 0, 1, 1) # modify image size by specifying different coordinates
# plot and raster middle image
plot(c(0, 1), c(0, 1), type='n', axes=F, xlab='', ylab='')
axis(1, at=0.5, labels='D. erythropa (K = 0.49)', tick=F, line=-1, cex.axis=1.5)
rasterImage(erythropa, 0, 0, 1, 1) # modify image size by specifying different coordinates
#plot and raster image on right
plot(c(0, 1), c(0, 1), type='n', axes=F, xlab='', ylab='')
axis(1, at=0.5, labels='S. constrictum (K = 0.95)', tick=F, line=-1, cex.axis=1.5)
rasterImage(constrictum, 0, 0, 1, 1) # modify image size by specifying different coordinates
It is possible to use the graphics window in an interactive way to obtain information contained within the plot. For instance, we have seen how we can import images into the graphics window, and we can also use the ‘locator’ function to identify points on those images and calculate their values.
# load library for reading 'png' image
library(png)
# load image data, prepare plot window and render in graphics window
# load image data
image <- readPNG('barplot1.png')
# set margins to zero on all sides
par(mar=rep(0, 4))
# new plot with x- and y-axis limits equalling width and height of image
plot(c(0, dim(image)[1]), c(0, dim(image)[2]), type='n')
# render image within plot
rasterImage(image, 0, 0, dim(image)[1], dim(image)[2], interpolate=F)
# locate mean values for 1999
means.1999 <- locator(7)
# locate upper CI for 1999
cis.1999 <- locator(7)
# locate range of y-axis
y0 <- locator(1)
y30 <- locator(1)
# show points that were clicked on
points(means.1999, pch=16)
points(cis.1999, pch=1)
points(y0, pch=15)
points(y30, pch=15)
# calculate actual means
30 * (means.1999$y - y0$y) / (y30$y - y0$y)
## [1] 24.45166 27.93661 26.49128 22.40729 16.75911 28.89699 17.60063
# calculate actual CIs
30 * (cis.1999$y - means.1999$y) / (y30$y - y0$y)
## [1] 2.087163 4.288431 1.198098 1.326466 1.621236 5.881141 2.263074
When making a scatter plot of a large dataset, a common problem is that many points are plotted on top of each other. This is sometimes referred to as ‘overplotting’. This is an issue because we no longer can see where most of the data occur, and visually this problem tends to make us think the variation in the data is larger than it actually is.
For this problem we use 18k observations of air temperature and air pressure measured at the HFE experiment on the Hawkesbury campus in 2008.
hfemet <- read.csv("HFEmet2008.csv")
hfemet <- subset(hfemet, AirPress > 98) # remove a couple of outliers
# Using a filled symbol makes the problem a bit worse, of course.
with(hfemet, plot(Tair, AirPress, pch=19))
It is impossible to see whether some values are more common than others, we just see a black circle (except for the edges).
The first approach, which is only partially successful, is to use transparent symbols. To some extent, the colour will be a function of how many points are overplotted. Unfortunately this approach is quite limited for larger datasets (but for smaller datasets with minor overplotting issues, it can be very effective to use transparent colours).
# for 'alpha'
library(scales)
with(hfemet, plot(Tair, AirPress, pch=16, col=alpha("red",0.3)))
A more effective approach is to colour the points by the local density of points in that region, using the convenient densCols function (base). Using this approach, we suddenly recognize a region of very high density - many points are located around 18C and air pressure of ca. 102.5.
# Setup vector of colours - based on a local density estimator.
cols <- with(hfemet, densCols(Tair, AirPress))
# Use the vectour of colours in a standard scatter plot.
with(hfemet, plot(Tair, AirPress, col=cols, pch=19))
The next approach is to use a ‘hexbin’ plot, which divides the plotting region into hexagonals, and counts the number of observations in each. The disadvantage of hexbin is that the grid graphics system is used, which is not compatible with base or ggplot. Thus it is difficult to adjust anything on the figure, and not possible to use it in conjunction with par, or layout, or basically anything you know.
library(hexbin)
h <- with(hfemet, hexbin(Tair, AirPress))
plot(h)
Finally, ggplot also has a hexbin built in, which we can invoke with stat_binhex, like so. Note the use of scale_fill_gradientn to set the colours (the default is not great).
ggplot(hfemet, aes(Tair, AirPress)) + stat_binhex() +
scale_fill_gradientn(colours=c("grey","red"))
It is straightforward in base R to make a scatter plot where points are colours by levels of a factor variable. It is however far from easy to do the same in a line plot, where points for a group are connected by a line. The solution in base R is to write somewhat verbose code, using a for loop, and adding one line at a time for each group.
The following example uses election poll data for a Dutch election in 2012. Note that the data are in wide format; poll percentages for each party are organized in different columns.
# Read the Dutch election poll data
election <- read.csv("dutchelection.csv")
election$Date <- as.Date(election$Date)
# Plot the first variable (make sure to set the Y axis range
# wide enough for all the other data!)
plot(VVD ~ Date, data=election, type='l', col="blue", ylim=c(0,30),
ylab="Poll result (%)")
palette(rainbow(12))
# Loop through columns
for(i in 2:ncol(election)){
# Find column name
party <- names(election)[i]
# Use column name to index data, and add a line
lines(election$Date, election[,party], col=palette()[i])
}
For line plots, ggplot2 is much easier to use - but we do have to reshape the data to long format, since ggplot does not do anything with wide format. In this case we can conveniently use gather from tidyr. Another major benefit is that ggplot adds a legend by default.
We also show how to set the colours of the line, to make it look more like the plot with base, and also to use a more simple theme, and switch off the grid lines.
library(tidyr)
# Make data into long format
elect_long <- gather(election, key="party",
value="vote_percent", -Date)
# Setting col=party in the aes statement will make a line for each party,
# with a different colour
ggplot(elect_long, aes(Date, vote_percent, col=party)) +
geom_path() + scale_colour_manual(values=rainbow(12)) +
ylim(c(0,30)) +
labs(x="Date", y="Poll result (%)") +
theme_bw() + # Black and white theme
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # No grid lines
In this example, we make grouped and stacked bar plots for data that are already tabulated. The example comes from Berkeley admissions data into the graduate school for 1973, where student numbers have already been tabulated, and separated into ‘admitted’, ‘denied’, for male and female students. The data as provided is already in wide format, in which case it is very convenient to use barplot directly to construct a stacked or grouped barplot.
berk <- read.csv("berkeley.csv")
berk
## Department Admitted_Male Denied_Male Admitted_Female Denied_Female
## 1 A 512 313 89 19
## 2 B 313 207 17 8
## 3 C 120 205 202 391
## 4 D 138 279 131 244
## 5 E 53 138 94 299
## 6 F 22 351 24 317
To make a stacked barplot, we first have to prepare a matrix with the data (a dataframe won’t do). Each column will be a bar, or group of bars, with each group defined by the rows. In this case, we want Admitted_Female and Denied_Female to be stacked.
m <- t(as.matrix(berk[,c("Admitted_Female","Denied_Female")]))
m
## [,1] [,2] [,3] [,4] [,5] [,6]
## Admitted_Female 89 17 202 131 94 24
## Denied_Female 19 8 391 244 299 317
# Plot matrix - makes a stacked barplot. We provide the names for the bars via `names.arg`
barplot(m, names.arg=berk$Department)
Try this yourself : You can make a grouped barplot by specifying beside=TRUE in the call to barplot.
The gplots package provides the barplot2 function which adds useful functionality like automatic legend placement, and somewhat nicer default colours (though not for just two groups as in this example).
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
barplot2(m, names.arg=berk$Department, legend=TRUE)
The following example shows how to use ggplot2 to make the same (well, similar) figure. The difficulty here is that ggplot2 can only data in long format. Normally this is not a problem, but in this case it makes life difficult and we have to use gather to make the data in long format. Also note the use of the “pipe operator” %>% to make a pipeline. The way this works is that the output from the first statement (reshape) is sent to the next statement (rename), where it is used as the first argument, and so on. This makes code shorter and often easier to read.
# for %>%, filter, gather and separate
library(dplyr)
library(tidyr)
# First make the dataframe into long format.
# The column names are now levels of a new factor variable (called 'variable' by default)
berklong <- gather(berk, key="variable", value="Nr_students", -Department)
# The next and final step is to separate the 'variable' into
# gender and status (admitted/denied)
berklong <- separate(berklong, "variable", c("status","gender"))
head(berklong)
## Department status gender Nr_students
## 1 A Admitted Male 512
## 2 B Admitted Male 313
## 3 C Admitted Male 120
## 4 D Admitted Male 138
## 5 E Admitted Male 53
## 6 F Admitted Male 22
Now we are ready to make the plot similar to the one above. In geom_bar, the argument stat="identity" indicates that the actual data in the dataframe will be used to draw the bars. This is needed because it is also possible to first tabulate (or otherwise summarize) the data in the call to geom_bar before plotting (see next Section).
# Take subset with only Female students (using filter from dplyr - can use subset here as well)
filter(berklong, gender == "Female") %>%
ggplot(aes(x=Department, y=Nr_students, fill=status)) +
geom_bar(stat="identity")
Once we have the data in the right format, it is also straightforward to make a multiple panel graph, split by gender, using facet_wrap.
ggplot(berklong, aes(x=Department, y=Nr_students, fill=status)) +
geom_bar(stat="identity") + facet_wrap(~gender)
Try this yourself : the order of the ‘stacking’ in the barplots is determined by the order of the levels of the factor variable used to colour the sections of the bar. Try reversing it with berklong$status <- factor(berklong$status, levels=c("Denied","Admitted"))
In the plots above, the default barplot is to stack the individual groups. Two options are useful as well, the first is to place each of the groups next to each other using position=position_dodge(). In this example we also change the colours of the groups with scale_fill_manual.
filter(berklong, gender == "Female") %>%
ggplot(aes(x=Department, y=Nr_students, fill=status)) +
geom_bar(stat="identity", position=position_dodge()) +
scale_fill_manual(values=c("forestgreen","red"))
And finally, for data like these, it is also useful to plot the admitted / denied data as percentages, by using position=position_fill(), and the use of the scales package to add percent labels on the y-axis.
library(scales) # for percent()
ggplot(berklong, aes(x=Department, y=Nr_students, fill=status)) +
geom_bar(stat="identity", position=position_fill()) +
scale_fill_manual(values=c("forestgreen","red")) +
labs(y="Percentage") +
facet_wrap(~gender) +
scale_y_continuous(labels = percent)
This example shows how to make bar plots of frequencies (counts) of data by groups, using base or ggplot2.
For base, we can use table to produce a table of counts, or tapply to calculate a simple statistic by a combination of factor variables. Since table and tapply both return a matrix by default, it is very convenient to use them in barplot to make a grouped barplot.
We here use the famous Titanic dataset, and sum the number of survivors and non-survivors by passenger class and sex.
titanic <- read.table("titanic.txt", header=TRUE)
head(titanic)
## Name PClass Age Sex
## 1 Allen, Miss Elisabeth Walton 1st 29.00 female
## 2 Allison, Miss Helen Loraine 1st 2.00 female
## 3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female
## 5 Allison, Master Hudson Trevor 1st 0.92 male
## 6 Anderson, Mr Harry 1st 47.00 male
## Survived
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
Make table of sums of survived (Survived = 1) or perished (Survived = 0) by passenger class and sex.
titan_surv <- with(titanic, tapply(Survived, list(PClass, Sex), FUN=sum))
The number of survivors is not that interesting if we don’t know how many passengers were in each group. We can tabulate that with table :
titan_n <- with(titanic, table(PClass, Sex))
Now we can calculate the proportion survived by simply dividing titan_surv by titan_n - this works because both matrices are the same dimension (check this!).
titan_surv_prop <- titan_surv / titan_n
And we can use this to make our barplot (again using barplot2 from the gplots package). This time a stacked barplot makes no sense - we want the bars within a group to be next to each other.
barplot2(titan_surv_prop, beside=TRUE, legend=TRUE, ylab="Proportion survived", ylim=c(0,1))
box()
We can make the same plot using ggplot, as follows. To calculate the proportion survived, we use the dplyr package. First, set the grouping variables (PClass and Sex), and then summarize the data by these groupings to yield a table similar to the one above (but in long format). This can then be sent to ggplot.
library(dplyr)
surv_prop_2 <- group_by(titanic, PClass, Sex) %>%
summarize(surv_prop = sum(Survived) / n())
# the n() function comes with dplyr; a simple count function.
# Make plot
# Note use of 'position="dodge"' to place bars next to each other.
# Try switching Sex and PClass to see what happens to the plot.
ggplot(surv_prop_2, aes(Sex, surv_prop, fill=PClass )) +
geom_bar(stat="identity", position="dodge") +
labs(y = "Proportion survived")
This example shows how to customize plotting of timeseries, specifically a numeric vector against a vector of class Date or POSIXct (i.e. a date-time representation). For the POSIXct example, we will use the “hfemet2008.csv” dataset used in an example above, and first prepare the data.
hfemet <- read.csv("HFEmet2008.csv")
# Convert DateTime to POSIXct using lubridate
library(lubridate)
hfemet$DateTime <- mdy_hm(hfemet$DateTime)
Now plot the data. To customize the x-axis, use axes=FALSE to suppress the axes, and then add them in a customized way using axis.POSIXct. Date formatting strings can be found on ?strptime.
Try this yourself: First make the below plot with the default axes, to note that the labels on the X-axis are not placed for every month, but rather every two months (and ending in January though the data actually end in December).
with(hfemet, plot(DateTime, Tair, type='l', col="cornflowerblue", axes=FALSE, xlab=""))
axis(2) # y-axis
# Make vector of Date-times where you want tick marks to be placed:
d_maj <- seq(from=ymd_hm("2008-1-1 12:00"), to=ymd_hm("2009-1-31 12:00"), by="1 month")
axis.POSIXct(1, at=d_maj, format="%b")
box()
# Or, use month numbers:
#axis.POSIXct(1, at=d_maj, format="%m")
# Or, use month/year:
# axis.POSIXct(1, at=d_maj, format="%m/%y")
Try this yourself: Make an x-axis where the Date labels are placed every three months (by="3 months" in the seq statement), but tick marks are placed every month. To do this, use axis.POSIXct twice: once for the axis with labels (every 3 months), and once wihtout labels by setting labels=FALSE.
Similarly, we can customize the axis when plotting just a single day. Here the axis labels are set very small to avoid labels being chopped off.
par(cex.axis=0.7)
hfemet_jun1 <- subset(hfemet, as.Date(DateTime) == "2008-6-1")
with(hfemet_jun1, plot(DateTime, Tair, type='l',
xlab="Hour",
axes=FALSE))
d <- with(hfemet_jun1, seq(min(DateTime), max(DateTime), by="1 hour"))
axis.POSIXct(1, at=d, format="%H")
axis(2)
box()
Alternatively, we can use a tick mark for every hour, but only place labels every 3 hours. In addition, we make the tick marks for non-label ticksa bit shorter, by setting the tcl parameter (which is sent to par via axis.POSIXct).
Also note we extend the x-axis range a bit by adding 30*60 to the max(DateTime), which is 30 min since DateTime (POSIXct) is coded as number of seconds (since some baseline date).
with(hfemet_jun1, plot(DateTime, Tair, type='l',
xlab="",
xlim=c(min(DateTime), max(DateTime)+30*60),
axes=FALSE))
axis.POSIXct(1, at=d, format="%H", labels=FALSE, tcl=-0.2)
lab <- with(hfemet_jun1, seq(min(DateTime), max(DateTime)+30*60, by="3 hours"))
axis.POSIXct(1, at=lab, format="%H:%M", labels=TRUE)
axis(2)
box()
Making publication quality plots of predictions is one area where ggplot is superior to the base graphics system. The model predictions can be generated from a single line of code nested within a call to ggplot.
Here is an example showing predictions for a combination of continuous and categorical predictor variables. Photosynthesis was measured on plants growing under two CO\(_2\) conditions and exhibiting different transpiration rates.
# read in the data
eucgas <- read.csv("eucface_gasexchange.csv")
# make a plot showing how 'Photo' changes with 'Trmmol' for the two 'CO2' levels
ggplot(eucgas, aes(Trmmol, Photo, color=CO2)) +
# add a line and ribbon showing the fitted model and confidence interval
# here we used 'fullrange=T' so that both lines extended over the whole range
stat_smooth(method='lm', aes(fill=CO2), fullrange=T, show.legend=F) +
# add points representing the raw data
geom_point() +
# modify axis labels
ylab('Photosynthetic rate') + xlab('Transpiration rate') +
# change colours and modify legend
scale_fill_manual(values=c("blue", "red")) +
scale_color_manual(name=expression(CO[2]),
labels=c("Ambient", "Ambient +\n150 ppm"),
values=c("blue", "red")) +
# customise the theme
theme_bw()
Let’s return to the titanic data. We previously showed how we can produce barplots showing proportions of surviving passengers on the titanic. We also have information regarding the age of the passengers and can use this to estimate the probability of survival.
library(ggplot2)
# read the data and remove any missing values
titanic <- read.table('titanic.txt', header=T)
titanic <- titanic[complete.cases(titanic), ]
# fit a model and show that all three predictors are significant
# however, note that this isn't necessary to plot the result
m1 <- glm(Survived ~ Age + Sex + PClass, data=titanic, family=binomial)
car::Anova(m1)
## Analysis of Deviance Table (Type II tests)
##
## Response: Survived
## LR Chisq Df Pr(>Chisq)
## Age 28.454 1 9.595e-08 ***
## Sex 214.776 1 < 2.2e-16 ***
## PClass 100.445 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# make a plot showing how the probability of survival changes with age for each group
ggplot(titanic, aes(Age, Survived, color=Sex)) +
# add a line and ribbon showing the fitted model and confidence interval
stat_smooth(method='glm', method.args=list(family='binomial'),
alpha=0.2, size=1, aes(fill=Sex)) +
# add points representing the raw data to show the density of observations
# do this separately for males and females so that they don't overlap
geom_point(data=titanic[titanic$Sex=='female', ], position=position_nudge(y=0.01), alpha=0.2) +
geom_point(data=titanic[titanic$Sex=='male', ], position=position_nudge(y=-0.01), alpha=0.2) +
# add axis labels
xlab('Age') + ylab('Probability of survival') +
# produce separate plots for each passenger class, and customise the theme
facet_grid(. ~ PClass) + theme_bw()
Barplots with error bars can be produced by including two calls to stat_summary during the use of ggplot. The first calculates averages for each group while the second calculates limits associated with one standard error on either side of the mean. We can demonstrate this using the pupae data.
# read the data, change 'CO2_treatment' to factor and remove NAs
pupae <- read.csv('pupae.csv')
pupae$CO2_treatment <- factor(pupae$CO2_treatment)
pupae <- filter(pupae, !is.na(Frass))
# plot the data
ggplot(pupae, aes(x=T_treatment, y=Frass, fill=CO2_treatment)) +
# produce bars showing mean for each group
stat_summary(geom="bar", fun.y=mean, position=position_dodge(width=0.9)) +
# produce error bars showing SE for each group
stat_summary(geom="errorbar", fun.data=mean_se,
position=position_dodge(width=0.9), width=0.25) +
# change the axis labels
ylab('Frass (mg)') +
scale_x_discrete(name='Temperature', labels=c('ambient', 'ambient + 3°C')) +
# customise theme
theme_bw() +
# change the legend title, labels and position
scale_fill_manual(name=expression(CO[2]~treatment),
labels=c("280 ppm", "400 ppm"),
values=c("grey50", "grey80")) +
theme(legend.position='top')
Animations can be made in R by embedding a call to plot within a loop, but rendering time that is to slow or fast can make this a not very appealing option. Here we use the animation package to show how we can create animations in an export in gif format.
In this example we will take increasingly larger samples from the standard normal distribution and show how this increases the precision of our estimate of the population mean (i.e., smaller confidence interval) but does not affect the sample standard deviation.
# load the animation library
library(animation)
# create a function for calculating standard error
se <- function(vec){sd(vec)/sqrt(length(vec))}
# save loop as gif
saveGIF({
x <- c() # empty vector to store samples
for(i in seq(2, 500, 2)) { # set the sampling sequence
# sample ten more individuals from population
x <- c(x, rnorm(2, 0, 1))
# plot histogram with new sample
hist(x, xlim=c(-3, 3), freq=T,
breaks=seq(-10, 10, 0.2),
main=paste('n = ', i, sep=' '))
# calculate mean, sd, se
xm <- mean(x)
xsd <- sd(x)
xse <- se(x)
# add line for population mean (0)
abline(v=0, col='black',lty='dashed', lwd=2)
# add lines for sample mean (red), 95% CI (blue), SD (orange)
abline(v=c(xm, xm-2*xse, xm+2*xse, xm-xsd, xm+xsd),
col=c('red', 'blue', 'blue', 'orange', 'orange'),
lwd=2)
}
# set file name, interval between images (in milliseconds)
}, movie.name='norm.gif', interval=0.1)
Here is another example, this time producing barplots with data from the 2019 Ultra-Trail Australia 100km event. We use animation to show the progression of runners through each of the split points in the race and the finish point. First we need to manipulate the data from its original format (split times for each runner, in rows, at each point, in columns). The data were obtained from https://www.ultratrailaustralia.com.au/results/100km/2019[here].
# load libraries
library(tidyverse)
library(chron)
library(animation)
# read data
uta19 <- read.csv('rawData/2019.csv', stringsAsFactors=F)
# set the number of minutes between each time step
step.m <- 5
# clean the data and change from wide to long
splt <- uta19 %>%
# only columns with split times, plus runner number
select(Race.No, starts_with('Race.Split.to')) %>%
# convert text to number of minutes since race start
mutate_if(is.character, function(x){
X <- strsplit(x, ':')
X[sapply(X, length) != 3] <- NA
sapply(X, function(y)(sum(as.numeric(y) * c(3600, 60, 1)) / 60))
}) %>%
# wide to long
gather(key='split', value='time_minutes', -Race.No) %>%
# clean up labels for split points
mutate(split = gsub('^.+\\.\\.', '', split),
split = gsub('\\.$', '', split),
# make 'split' a factor, sort labels by time
# of fastest runner arriving at each split point
split = factor(split),
split = reorder(split, time_minutes, FUN=min, na.rm=T),
# assign arrival time to a bin indicating the time step
time_minutes_agg = ceiling(time_minutes/step.m) * step.m)
# number of runners
total <- length(unique(splt$Race.No))
# sequence to time steps evaluated (removing '0')
time_minutes_agg <- seq(from=0,
to=round(max(splt$time_minutes_agg, na.rm=T), -1),
by=step.m)[-1]
# account for some time steps with no arrivals
splt <- full_join(data.frame(time_minutes_agg), splt, by='time_minutes_agg')
# make list, each element a dataframe for one time step
splt <- split(splt, splt$time_minutes_agg)
# number of runners arriving at each split point for each time step
splt <- sapply(splt, function(x)table(x[complete.cases(x), 'split']))
# add a column for the starting time
splt <- cbind(matrix(rep(0, nrow(splt)), nrow=nrow(splt),
dimnames=list(rownames(splt), '0')), splt)
# add a row for the starting split point
splt <- rbind(matrix(c(total, rep(0, ncol(splt)-1)), ncol=ncol(splt),
dimnames=list('0km', colnames(splt))), splt)
# add '0' to the sequence of time points being plotted
time_minutes_agg <- c(0, time_minutes_agg)
Then we loop through each time bin and produce a plot of the results, writing the result to gif format. Bars that remain show individuals who did not reach the next split point.
# save loop as gif
saveGIF({
# summarise number of runners at each split point
# at the end of each time bin
for(i in 1:length(time_minutes_agg)){
# select subset for timepoint 'i'
temp <- as.matrix(splt[, 1:i])
# calculate numbers of arrivals at each point
# while also calculating departures from earlier points
arrive <- rev(diff(rev(rowSums(temp))))
arrive <- c(arrive, `100km`=total-sum(arrive))
# results to dataframe for ggplot
temp <- data.frame(split=names(arrive),
runners=as.numeric(arrive),
stringsAsFactors=F)
# convert 'split' to factor
temp$split <- factor(temp$split, levels=rownames(splt))
# create object to display time
hm <- time_minutes_agg[i] / 60
h <- floor(hm)
m <- round((hm - h) * 60)
# barplot of number of runners at each split
p <- temp %>% ggplot(aes(x=split, y=runners)) +
geom_bar(stat='identity') +
scale_x_discrete(drop=F) + # keep empty factor levels
ylim(c(0, total)) + # constant y-axis limits
labs(title = paste(h, 'hours,', m, 'minutes'),
x='Split point',
y='Number of runners')
# 'print' is necessary to plot ggplot objects with a loop
print(p)
}
# set file name, interval between images (in milliseconds)
}, movie.name='uta2019.gif', interval=0.1)